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Abstract 

Some images that are difficult to recognize on their own 
may become more clear in the context of a neighborhood 
of related images with similar social-network metadata. We 
build on this intuition to improve multilabel image annota¬ 
tion. Our model uses image metadata nonparametrically 
to generate neighborhoods of related images using Jaccard 
similarities, then uses a deep neural network to blend visual 
information from the image and its neighbors. Prior work 
typically models image metadata parametrically; in con¬ 
trast, our nonparametric treatment allows our model to per¬ 
form well even when the vocabulary of metadata changes 
between training and testing. We perform comprehensive 
experiments on the NUS-WIDE dataset, where we show that 
our model outperforms state-of-the-art methods for multil¬ 
abel image annotation even when our model is forced to 
generalize to new types of metadata. 

1. Introduction 

Take a look at the image in Figure la. Might it be a 
flower petal, or a piece of fruit, or perhaps even an octopus 
tentacle? The image on its own is ambiguous. Take another 
look, but this time consider that the images in Figure lb 
share social-network metadata with Figure la. Now the an¬ 
swer is clear: all of these images show flowers. The con¬ 
text of additional unannotated images disambiguates the vi¬ 
sual classiflcation task. We build on this intuition, showing 
improvements in multilabel image annotation by exploiting 
image metadata to augment each image with a neighbor¬ 
hood of related images. 

Most images on the web carry metadata; the idea of us¬ 
ing it to improve visual classiflcation is not new. Prior work 
takes advantage of user tags for image classiflcation and re¬ 
trieval [19, 5, 23, 38], uses GPS data [20, 35, 48] to improve 
image classiflcation, and utilizes timestamps [26] to both 
improve recognition and study topical evolution over time. 
The motivation behind much of this work is the notion that 
images with similar metadata tend to depict similar scenes. 

One class of image metadata where this notion is par- 
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Figure 1: On its own, the image in (a) is ambiguous - it 
might be a flower petal, but it could also be a piece of fruit or 
possibly an octopus tentacle. In the context of a neighbor¬ 
hood (b) of images with similar metadata, it is more clear 
that (a) shows a flower. Our model utilizes image neighbor¬ 
hoods to improve multilabel image annotation. 

ticularly relevant is social-network metadata, which can be 
harvested for images embedded in social networks such as 
Flickr. These metadata, such as user-generated tags and 
community-curated groups to which an image belongs, are 
applied to images by people as a means to communicate 
with other people; as such, they can be highly informa¬ 
tive as to the semantic contents of images. McAuley and 
Leskovec [37] pioneered the study of multilabel image an¬ 
notation using metadata, and demonstrated impressive re¬ 
sults using only metadata and no visual features whatsoever. 

Despite its signiflcance, the applicability of McAuley 
and Leskovec’s method to real-world scenarios is limited 
due to the parametric method by which image metadata 
is modeled. In practice, the vocabulary of metadata may 
shift over time: new tags may become popular, new image 
groups may be created, etc. An ideal method should be able 
to handle such changes, but their method assumes identical 
vocabularies during training and testing. 

In this paper we revisit the problem of multilabel image 
annotation, taking advantage of both metadata and strong 
visual models. Our key technical contribution is to generate 
neighborhoods of images (as in Figure 1) nonparametrically 
using image metadata, then to operate on these neighbor- 
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hoods with a novel parametric model that learns the degree 
to which visual information from an image and its neigh¬ 
bors should be trusted. 

In addition to giving state-of-the-art performance on 
multilabel image annotation (Section 5.1), this approach al¬ 
lows our model to perform tasks that are difficult or impos¬ 
sible using existing methods. Specifically, we show that our 
model can do the following: 

• Handle different types of metadata. We show that 
the same model can give state-of-the-art performance 
using three different types of metadata (image tags, im¬ 
age sets, and image groups). We also show that our 
model gives strong results when different metadata are 
available at training time and testing time. 

• Adapt to changing vocabularies. Our nonparamet- 
ric approach to handling metadata allows our model to 
handle different vocabularies at train and test time. We 
show that our model gives strong performance even 
when the training and testing vocabulary of user tags 
are completely disjoint. 

2. Related Work 

Automatic image annotation and image search. Our 

work falls in the broad area of image annotation and search 
[34]. Harvesting images from the web to train visual clas¬ 
sifiers without human annotation is an idea that have been 
explored many times in the past decade [14, 45, 32, 3, 43, 
7, 10, 6]. Early work on image annotation used voting to 
transfer labels between visually similar images, often using 
simple nonparametric models [36, 33]. This strategy is well 
suited for multimodal data and large vocabularies of weak 
labels, but is very sensitive to the metric used to find visual 
neighbors. Extensions use learnable metrics and weighted 
voting schemes [18, 44], or more carefully select the train¬ 
ing images used for voting [47]. Our method differs from 
this work because we do not transfer labels from the training 
set; instead we compute nearest-neighbors between test-set 
images using metadata. 

These approaches have shown good results, but are lim¬ 
ited because they treat tags and visual features separately, 
and may be biased towards common labels. Some authors 
instead tackle multilabel image annotation by learning para¬ 
metric models over visual features that can make predic¬ 
tions [17, 45, 49, 15] or rank tags [29]. Gong et al. [15] 
recently showed state of the art results on NUS-WIDE [8] 
using CNNs with multilabel ranking losses. These methods 
typically do not take advantage of image metadata. 

Multimodal representation learning: images and tags. 

A common approach for utilizing image metadata is to 
learn a joint representation of image and tags. To this end. 


prior work generatively models the association between vi¬ 
sual data and tags or labels [30, 2, 4, 40] or applies non¬ 
negative matrix factorization to model this latent structure 
[50, 13, 25]. Similarly, Niu et al. [38] encode the text tags 
as relations among the images, and define a semi-supervised 
relational topic model for image classification. Another 
popular approach maps images and tags to a common se¬ 
mantic space, using CCA or kCCA [46, 23, 16, 1]. This 
line of work is closely related to our task, however these 
approaches only model user tags and assume static vocabu¬ 
laries; in contrast we show that our model can generalize to 
new types of metadata. 

Beyond images and tags. Besides user tags, previous 
work uses GPS and timestamps [20, 35, 26, 48] to improve 
classification performance in specific tasks such as land¬ 
mark classification. Some authors model the relations be¬ 
tween images using multiple metadata [41, 37, 11, 28, 12]. 
Duan et al. [11] present a latent CRE model in which tags, 
visual features and GPS-tags are used jointly for image clus¬ 
tering. McAuley and Leskovec model pairwise social rela¬ 
tions between images and then apply a structural learning 
approach for image classification and labeling [37]. They 
use this model to analyze the utility of different types of 
metadata for image labeling. Our work is similarly moti¬ 
vated, but their method does not use any visual representa¬ 
tion. In contrast, we use a deep neural network to blend the 
visual information of images that share similar metadata. 

3. Model 

We design a system that incorporates both visual features 
of images and the neighborhoods in which they are embed¬ 
ded. An ideal system should be able to handle different 
types of signals, and should be able to generalize to new 
types of image metadata and adapt to their changes over 
time (e.g. users add new tags or add images to photo-sets). 
To this end we use metadata nonparametrically to generate 
image neighborhoods, then operate on images together with 
their neighborhoods using a parametric model. The entire 
model is summarized in Eigure 2. 

Let A be a set of images, Y a set of possible labels, 
and V = {{x^y) \ x e y C Y} a. dataset associating 
each image with a set of labels. Let Z be a set of possible 
neighborhoods for images; in our case a neighborhood is a 
set of related images, so Z is the power set Z = 2^. 

We use metadata to associate images with neighbor¬ 
hoods. A simple approach would assign each image x e X 
to a single neighborhood ^ G Z; however there may be 
more than one useful neighborhood for each image. As 
such, we instead use image metadata to generate a set of 
candidate neighborhoods for each image x. 

At training time, each element of is a set of training 
images, and is computed using training image metadata. At 
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Figure 2: Schematic of our model. To make predictions 
for an image, we sample several of its nearest neighbors to 
form a neighborhood and we use a CNN to extract visual 
features. We compute hidden state representations for the 
image and its neighbors, then operate on the concatenation 
of these two representations to compute class scores. 


test time, test image metadata is used to build from test 
images; note that we do not use the training set at test time. 

For an image x e X and neighborhood z e wo use 
a function / parameterized by weights w to predict label 
scores f{x^z;w) G for the image x. We average these 
scores over all candidate neighborhoods for x, giving 


s{x; w) 



E 
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f{x,z;w). 


( 1 ) 


To train the model, we choose a loss i and optimize: 


re* = arg min E £{s{x;w),y). 

^ (a;,y)6X> 


( 2 ) 


The set may be large, so for computational efficiency we 
approximate s{x;w) by sampling from During training, 
we draw a single sample during each forward pass and at 
test time we use ten samples. 


3.1. Candidate Neighborhoods 

We generate candidate neighborhoods using a nearest- 
neighbor approach. We use image metadata to compute a 
distance between each pair of images. We fix a neighbor¬ 
hood size m > 0 and a max rank M > m\ the candidate 
neighborhoods for an image x then consist of all subsets 
of size m of the M-nearest neighbors to x. 

The types of image metadata that we consider are user 
tags, image photo-sets, and image groups. Sets are gal¬ 
leries of images collected by the same user (e.g. pictures 
from the same event such as a wedding). Image groups are 
community-curated; images belonging to the same concept, 
scene or event are uploaded by the social network users. 
Each type of metadata has a vocabulary T of possible val¬ 
ues, and associates each image x e X with a subset tx 
of values. For tags, T is the set of all possible user tags and 
tx are the tags for image x; for groups (and sets), T is the 
set of all groups (sets), and tx are the groups (sets) to which 


X belongs. For sets and groups, we use the entire vocabu¬ 
lary T; in the case of tags we follow [37] and select only the 
r most frequently occurring tags on the training set. 

We compute the distance between images using the Jac- 
card similarity between their image metadata. Concretely, 
for x^x' G X we compute 

ci(x,a:;') = 1 - |4 n4/|/|4 U4/|. (3) 

To prevent an image from appearing in its own neighbor¬ 
hoods, we set d{x^ x) = 0 for all x e X. 

Generating candidate neighborhoods introduces several 
hyperparameters, namely the neighborhood size m, the max 
rank M, the type of metadata used to compute distances, 
and the tag vocabulary size r. We show in Section 5.2 that 
the type of metadata is the only hyperparameter that signif¬ 
icantly affects our performance. 

3.2. Label Prediction 

Given an image x e X and a neighborhood 2 ; = 
{zi, ..., G E, we design a model that incorporates vi¬ 
sual information from both the image and its neighborhood 
in order to make predictions for the image. Our model is es¬ 
sentially a fully-connected two layer neural network applied 
to features from the image and its neighborhood, except that 
we pool over the hidden states for the neighborhood images. 

We use a CNN [31, 27] (j) to extract d-dimensional 
features from the images x and Zi. We compute an h- 
dimensional hidden state for each image by applying an 
affine transform and an elementwise ReLU nonlinearity 
cr(^) = max(0, ^) to its features. To let the model treat hid¬ 
den states for the image and its neighborhood differently, 
we apply distinct transforms to 0 (x) and ^(2:^), parameter¬ 
ized by Wx G hx G and Wz G bz G 

At this point we have hidden states Vx , Vz^ G for 
X and each Zi G 2;; to generate a single hidden state 
Vz G for the neighborhood 2; we pool each Vz^ elemen¬ 
twise so that {vz)j = mdiXi{vzi)j. Finally to compute la¬ 
bel scores f{x,z;w) G we concatenate Vx and Vz and 
pass them through a third affine transform parameterized by 
Wy G 6 ^ G To summarize: 


Vx = cr{Wx(t){x) + bx) 


Vz = max I (T(Wz(t){zi 


f{x,w;z) = Wy 


(4) 

(5) 

( 6 ) 


The learnable parameters are Wx, bx, Wz, bz, Wy, and by. 

3.3. Learned Weights 

An example of a learned matrix Wy is visualized in Fig¬ 
ure 3. The left and right sides multiply the hidden states 
for the image and its neighborhood respectively. Both sides 















Figure 3: Learned weights Wy. The model uses features from both the image and its neighbors. We show examples of images 
whose label scores are influenced more by the image and by its neighborhood; images with the same ground-truth labels are 
highlighted with the same colors. Images that are influenced by their neighbors tend to be non-canonical views. 


contain many nonzero weights, indicating that the model 
learns to use information from both the image and its neigh¬ 
borhood; however the darker coloration on the left suggests 
that information from the image is weighted more heavily. 

We can follow this idea further, and use Equation 6 to 
compute for each image the portion of its score for each 
label that is due to the hidden state of the image and its 
neighborhood Vz . The left side of Figure 3 shows examples 
of correctly labeled images whose scores are more due to 
the image, while the right shows images more influenced 
by their neighborhoods. The former show canonical views 
(such as a bride and groom for wedding) while the latter are 
more non-canonical (such as a zebra crossing a road). 

3.4. Implementation details 

We apply L 2 regularization to the matrices Wx^ Wz, and 
Wy and apply dropout [22] with p = 0.5 to the hidden lay¬ 
ers hx and hz . We initialize all parameters using the method 
of [21] and optimize using stochastic gradient descent with 
a flxed learning rate, RMSProp [42], and a minibatch size 
of 50. We train all models for 10 epochs, keeping the model 
snapshot that performs the best on the validation set. For all 
experiments we use a learning rate of 1 x 10“^, L 2 regular¬ 
ization strength 3 x 10“^ and hidden dimension h = 500; 
these values were chosen using grid search. 

Our image feature function (j) returns the activations of 
the last fully-connected layer of the BLVC Reference Caf- 
feNet [24], which is similar to the network architecture of 
[27]. We ran preliminary experiments using features from 
the model of VGG [39], but this did not signiflcantly change 
the performance of our model. For all models our loss func¬ 
tion ^ is a sum of independent one-vs-all logistic classiflers. 

4. Experimental Protocol 
4.1. Dataset 

In all experiments we use the NUS-WIDE dataset [8], 
which has been widely used for image labeling and re¬ 


trieval. It consists of 269,648 images collected from Flickr, 
each manually annotated for the presence or absence of 
81 labels. Following [37] we augment the images with 
metadata using the Flickr API, discarding images for which 
metadata is unavailable. Following [15] we also discard im¬ 
ages for which all labels are absent. This leaves 190,253 im¬ 
ages, which we randomly partition into training, validation, 
and test sets of 1 lOK, 40K, and 40,253 images respectively. 
We generate 5 such splits of the data and run all experiments 
on all splits. Statistics of the dataset can be found in Table 1. 
We will make our data and features publicly available to fa¬ 
cilitate future comparisons. 


NUS-WIDE 

Labels 

Tags 

Sets 

Groups 

# unique elements 

81 

10,000 

165,039 

95, 358 

# image per (.) 

5701.3/1682 

270.3/91 

2.3/1 

26.1/2 

# (.) per image 

2.4/2 

14.2/11 

2.0/1 

13.1/8 


Table 1: Dataset statistics. Image and (.) counts are reported 
in the format mean / median. 


4.2. Metrics 

Prior work uses a variety of metrics and experimental se¬ 
tups on NUS-WIDE, making direct comparisons of results 
difficult. Following prior work [36, 18, 44, 15] we assign 
a flxed number of labels to each image and report (overall) 
precision Precj and recall Recj; we also compute the pre¬ 
cision and recall for each label and report the mean across 
labels as the per-label metrics PrecL, RecL. 

NUS-WIDE has a highly uneven distribution of labels; 
the most common (sky) has over 68,000 examples and the 
least common (map) has only 53. As a result the overall 
precision and recall statistics are strongly biased towards 
the common labels. The precision and recall for uncommon 
labels are extremely noisy since they are based on only a 
handful of test-set examples, and the mean per-label statis¬ 
tics inherit this noise since they weight all classes equally. 

Mean Average Precision (mAP) is another widely used 
metric [37, 34]; it directly measures ranking quality, so 
it naturally handles multiple labels and does not require 
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Figure 4: Example results. For each image we show the top 3 scoring labels using the visual-only (V-only) model and our 
model using tag nearest neighbors; correct labels are shown in blue and incorrect labels in red. We also show the 6 nearest 
neighbors to each image; its neighborhoods are drawn from these images. The red dashed lines show failure cases. 


Method 

mAPi 

mAP/ 

RecL 

PrecL 

Recj 

Precj 

Tag-only Model + linear SVM [37] 

46.67 

- 

- 

- 

- 

- 

Graphical Model (all metadata) [37] 

49.00 

- 

- 

- 

- 

- 

CNN + softmax [15] 

- 

- 

31.22 

31.68 

59.52 

47.82 

CNN + ranking [15] 

- 

- 

26.83 

31.93 

58.00 

46.59 

CNN + WARP [15] 

- 

- 

35.60 

31.65 

60.49 

48.59 

Upper bound 

lOO.OOio.oo 

lOO.OOio.oo 

68.52±o.35 

60.68±i.32 

92.09±o.io 

66.83±o.i2 

Tag-only + logistic 

43.88±o.32 

77.06±o.i4 

47.52±2.59 

46.83±o.89 

71.34±o.i6 

51.18±o.i6 

CNN [27] + kNN-voting [36] 

44.03±o.26 

73.72ibo.io 

30.83±o.37 

44.41±i.05 

68.06±o.i5 

49.49±o.ii 

CNN [27] + logistic (visual-only) 

45.78±o.i8 

77.15±o.ii 

43.12±o.39 

40.90±o.39 

71.60±o.i9 

51.56±o.ii 

Image neighborhoods + CNN-voting 

50.40±o.23 

77.86±o.i5 

34.52±o.47 

56.05dzi.47 

72.12±o.2i 

51.91±o.20 

Our model: tag neighbors 

Our model: tag neighbors + tag vector 

52.78±o.34 

61.88dbo.36 

80.34±o.07 

80.27±o.o8 

43.61±o.47 

57.30dzO.44 

46.98±i.oi 

54.74±o.63 

74.72±o.i6 

75.10zbo.20 

53.69±o.i3 

53.46±o.09 


Table 2: Results on NUS-WIDE. Precision and recall are measured using n = 3 labels per image. Metrics are reported both 
per-label (mAPi,) and per-image (mAP/). We run on 5 splits of the data and report mean and standard deviation. 


choosing a fixed number of labels per image. As with other 
metrics, we report mAP both per-label (mAPi,) and per- 
image (mAP/). mAP/, is less noisy and hence preferable 
to other per-label metrics since it considers the full ranking 
of images instead of only the top labels for each image. 

5. Experiments 

5.1. Multilabel Image Annotation 

We show that our model achieves state-of-the art results 
for multilabel image annotation on NUS-WIDE. Our best 
model computes neighborhoods using tags with a vocab¬ 
ulary size of r = 5000, neighborhood size m = 3 and 


max-rank M = 6. Preliminary experiments at combining 
all types of metadata did not show improvements over us¬ 
ing tags alone. We also show the result of augmenting the 
hidden state of our model with a binary indicator vector of 
image tags. All results are shown in Table 2. 

Baselines. First we report the results of McAuley and 
Leskovec [37] and Gong et al. [15] as in their original pa¬ 
pers. Then we compare our model with four baselines: 

1. Tag-only logistic: the tag-only model of [37] rep¬ 
resents each image with a sparse binary vector indicating 
its tags, while their full model uses all available metadata 
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(tags, groups, galleries, and sets) and incorporates a graph¬ 
ical model to model pairwise interactions between these 
features. Unfortunately these results are not directly com¬ 
parable to ours, since they do not discard images without 
ground-truth labels; as a result they use 244K images for 
their experiments while we use only 190K. We reimple¬ 
ment a version of their tag-only model by training one-vs- 
all logistic classifiers on top of binary tag indicator features. 
Our reimplementation performs slightly worse than their re¬ 
ported numbers due to the difference in dataset size. 

2. CNN logistic loss: the results of [15] have been ob¬ 
tained using a deep convolutional neural networks in the 
style of [27] equipped with various multilabel loss func¬ 
tions. Again, these results are not directly comparable to 
ours because they train their networks from scratch on the 
NUS-WIDE dataset, while we use networks that were pre¬ 
trained on ImageNet [9] . We reimplement a version of their 
model by training one-vs-all logistic classifiers using the 
features extracted from our pretrained network. This is an 
extremely strong baseline; note that it already outperforms 
[15], highlighting the power of the pretrained network. 

3. CNN kNN voting: as an additional baseline we 
implement a simple nearest neighbor approach. For each 
test image we compute the L 2 distance between its CNN 
features and the features of all images in the training set; 
the ground-truth labels of the retrieved training images are 
then used in a voting scheme similar to [36, 33]. 

4. Image neighborhoods CNN-voting: for each test 
image we compute its M-nearest neighbors on the test set 
using user tags as in our full model, but instead of pass¬ 
ing these neighbors to our parametric model we apply the 
CNN-^logistic visual-only model to the image and its neigh¬ 
bors. Then we set the label scores of the test image to be a 
weighted sum of its visual-only label scores and the mean 
of the visual-only label scores of its neighbors. 

Upper bound. As discussed in Section 4.2, we assign the 
top n = 3 labels to each image and report precision both 
per-class and per-image (recall that the average number of 
labels per image is approximately 2.4). However many im¬ 
ages do not have exactly 3 ground-truth labels; this means 
that no classifier can achieve unit precision and recall. To 
estimate upper bounds for these metrics, we train one-vs- 
all logistic classifiers where each image its represented by a 
binary indicator vector encoding its ground-truth labels. As 
seen in Table 2, even this perfect classifier achieves far from 
perfect performance on many of the evaluation metrics. 

Results. Table 2 shows that our model outperforms prior 
work on nearly all metrics. The per-class precision and re¬ 
call metrics display high variance; as a result we do not be¬ 
lieve them to be the best indicators of performance. The 
mAP metrics give a clearer picture of performance, since 
they display lower variance and do not rely on annotating 
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Figure 5: (a) Our model shows large improvements for la¬ 
bels with high intra-class variability (e.g. wedding) and for 
labels where the visual model performs well (e.g. food). 
(b) Feft: AP for each label of our model the baseline; we 
improve for all but three labels {map, earthquake, rainbow). 
(b) Right: difference in class AP between our model and the 
visual-only model vs label frequency. 


each image with a fixed number of labels. On these metrics 
our model outperforms all baselines by a significant margin. 

As an extension, we append the binary tag vector to the 
representation learned by our model (tag neighbors -L tag 
vector); this does not significantly change performance as 
measured by per-image metrics, but does show improve¬ 
ment on per-class metrics. This suggests that the binary tag 
vector is especially useful for rare classes which may have 
strong correlations with certain user tags. Although it in¬ 
creases per-class performance, this extension significantly 
increases the number of learnable parameters and makes 
generalization to new types of metadata impossible. 

In order to qualitatively understand some of the cases 
where our model outperforms the baselines. Figure 4 com¬ 
pares the top three labels produced by our model and by 
the visual-only baseline. The additional visual information 
provided by the neighborhoods can help resolve ambigui¬ 
ties in non-canonical views; for example in the image of 
swimmers the visual-only model appears to mistake the col¬ 
orful swim caps for flowers, but the neighborhood provides 
canonical views of swimmers. 

In few cases the neighborhood can hurt performance. For 
example in the image of the boy with a dog, the visual- 
only model correctly produces a dog label but our model 
replaces this with a water label, likely because no neighbors 
contain dogs but two neighbors contain visible bodies of 
water. However the aggregate metrics of Table 2 make it 
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Figure 6: Probability that the /cth nearest neighbor of an 
image has a particular label given that the image has the 
label, as a function of k and using different metadata. The 
dashed lines give the overall probability that an image has 
the label. Across all metadata and all classes, an image and 
its neighbors are likely to share labels. 


Method 

mAPi 

mAP/ 

CNN [ 27 ] + logistic (visual-only) 

45.78zbo.i8 

77.15zbo.li 

Our model: visual neighbors 

47.45±o.i9 

78.56zbo.i4 

Our model: group neighbors 

48.87zbo.22 

79.11±o.i3 

Our model: set neighbors 

48.02zbo.33 

78.40zbo.25 

Our model: tag neighbors 

52.78zb0.34 

80.34zbo.07 


Table 3: Our model trained with different image neighbor¬ 
hoods the visual-only model. 




Neighborhood size (m) 


Figure 7: Performance of our model as we vary the neigh¬ 
borhood size m, max-rank M, and tag vocabulary size r. In 
all cases our model outperforms the baselines. 


clear that neighborhoods are beneficial more often than not. 

There are cases where both models fail; for example see 
the lower right image of Figure 4 which shows a person 
crouching inside a statue of a rabbit. The ground-truth la¬ 
bels for this challenging image are statue and person, which 
are produced by neither model. 

More quantitatively. Figure 5b compares the average pre¬ 
cision (AP) of both our model and the visual-only baseline 
for each label; our model outperforms the baseline on all 
but three labels: map, earthquake, and rainbow. Of these, 
map is the only label where our model is significantly out¬ 
performed by the baseline. Figure 5b also reveals that these 
three labels are among the most infrequent; they have only 
53, 56, and 397 instances respectively in the entire dataset, 
and an average of only 12.8, 13.2, and 82.0 instances re¬ 
spectively on the test sets. With so few test instances the 
performance of both models on these labels is highly sus¬ 
ceptible to noise. It is also interesting to note that the middle 
frequencies are the ones in which our model gives the major 
boost in performance, while for the very frequent labels it 
is still able to give slight improvements. 

Figure 5a also shows two example precision-recall 
curves. The wedding label has high intra-class variabil¬ 
ity, making it difficult to recognize using visual features 
alone; our model is able to give a large boost in performance 
by taking advantage of image metadata. Our model also 
gives improvements on labels such as food where the per¬ 
formance of the visual-only baseline is already quite strong. 

5.2. Neighborhood Hyperparameters 

Our method for generating image neighborhoods intro¬ 
duces several hyperparameters: the type of metadata used, 
the size m of each neighborhood, the max-rank M for 


neighbors, and the tag-vocabulary size r. Here we explore 
the infiuence of these hyperparameters on our model. 

Effects on performance. The most important hyperpa¬ 
rameter for generating neighborhoods is the type of data 
used. We show in Table 3 the performance of our model 
using different types of metadata: tags give the highest per¬ 
formance, followed by groups and then sets. In all cases 
our model outperforms the visual-only baseline. We also 
show the effect of using Euclidean distance of visual fea¬ 
tures to build neighborhoods (visual neighbors). This setup 
slightly outperforms the visual-only baseline but is outper¬ 
formed when using metadata, showing both the ability of 
our method to handle a variety of neighbor types, and the 
importance of image metadata. 

To study the effects of the neighborhood size m, the 
max-rank M, and the tag vocabulary size r we show in Fig¬ 
ure 7 the performance of our model as we vary these hy¬ 
perparameters. Varying the max-rank M gives the largest 
variation in performance, but in all cases we show improve¬ 
ments over the visual-only baseline and the results of [37]. 

Label correlations. We can better interpret the infiuence 
of neighborhood hyperparameters by studying the correla¬ 
tions between the labels of images and their nearest neigh¬ 
bors. With strong correlations, visual evidence for a la¬ 
bel among an image’s neighbors is evidence that the image 
should have the same label; as such, our model should per¬ 
form better when these correlations are stronger. 

To this end, we plot in Figure 6 the probability that the 
kth nearest neighbor of an image has a particular label given 
that the image itself has the label; on the same axis we show 





























the baseline probability that a random image in the dataset 
has the label. This experiment shows that the nearest neigh¬ 
bors of images are indeed very likely to share labels with an 
image, and helps to explain the influence of various hyper¬ 
parameters. An image’s labels are most highly correlated 
with its tag neighbors, followed by groups and then sets; 
this matches the results of Table 3. The flat shape of all 
curves in Figure 6 suggests that the 20th nearest neighbor 
is nearly as informative as the 10th, suggesting that larger 
max-ranks M may increase performance. 

5.3. Generalization Experiments 

One advantage of our model is that we only use metadata 
of images nonparametrically as a means to compute image 
neighborhoods. As a result, our model can easily cope with 
situations where different types of metadata are available 
during training and testing. 
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Figure 8: Performance as we vary overlap between tag vo¬ 
cabularies used for training and testing. Our model gives 


strong results even in the case of disjoint vocabularies. 


Test: 

Traini^^^^ 

Tags 

Sets 

Groups 

Tags 

52.78±o.34 

47.12±o.35 

48.14±o.33 

Sets 

52.21±o.29 

48.02±o.33 

48.49±o.16 

Groups 

50.32±o.28 

47.82±o.24 

48.87±o.22 


Vocabulary Generalization. Our best-performing model 
relies on user tags to generate image neighborhoods. In a 
real-world setting, the vocabulary of user tags may change 
over time: new tags may become popular, and older tags 
may fall into disuse. Any method that depends on user tags 
should be able to cope with these challenges. 

Ideally, to test our model’s resilience to changes in user 
tags over time, we would train the model using a snapshot of 
Flickr images at one point in time and test the model using 
a snapshot from a different point in time. 

Unfortunately we do not have access to this type of data. 
As a proxy to such an experiment, we instead randomly di¬ 
vide the lOK most commonly occurring user tags into two 
sets. During training we use the first set of user tags to gen¬ 
erate neighborhoods, and use the second during testing. We 
vary the degree to which the training tags and the testing 
tags overlap; with an overlap of 0% there are no tags shared 
between training and testing, and an overlap of 100% uses 
the same vocabulary of user tags for training and testing. 
Results are shown in Figure 8. 

We see that the performance of our model degrades as we 
decrease the overlap between the training and testing tags; 
however even in the case of 0% overlap our model is able to 
outperform both the visual-only model and [37]. 

Metadata Generalization. As a test of our model’s abil¬ 
ity to generalize across different types of metadata, we per¬ 
form an experiment where we use different types of meta¬ 
data during training and testing. For example, we gener¬ 
ate neighborhoods with tags during training and instead use 
sets during testing. Table 4 shows the quantitative results 
of this experiment; in all cases our model outperforms the 
visual-only baseline. These results suggest that our model 
could be applied in cases where some types of metadata are 
unavailable during testing. 


Table 4: Metadata generalization experiment. We use dif¬ 
ferent types of metadata during training and testing, and re¬ 
port mAPi, for all possible pairs. All combinations outper¬ 
form the visual-only model (45.78±o.34). 

We can explain the results of this experiment by again 
examining Figure 6. When we train using one signal and 
test using another, our train and test data are no longer 
drawn from the same distribution, breaking one of the core 
assumptions of supervised learning. However the paramet¬ 
ric portion of our model only views image metadata through 
the lens of nearest neighbors; Figure 6 shows that changing 
the method of computing these neighbors does not drasti¬ 
cally change the nature of the correlations between the la¬ 
bels of an image and its neighbors. 

6. Conclusion 

We have introduced a framework that exploits image 
metadata to generate neighborhoods of images, and uses a 
strong parametric visual model based on deep convolutional 
neural networks to blend visual information between an im¬ 
age and its neighbors. We use our model to achieve state- 
of-the-art performance for multilabel image annotation on 
the NUS-WIDE dataset. We also show that our model gives 
impressive results even when it is forced to generalize to 
new types of metadata at test time. 
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